Comparing common and rare single variant and gene aggregate instrumentation strategies for MR
Author
Aimee Hanson
Published
May 19, 2025
Introduction
Classically, Mendelian Randomisation (MR) methods utilising trait-associated genetic variants from GWAS studies have employed common polymorphisms (population MAF > 1%) targeted by genotyping chips, or reliably imputed from reference populations, to instrument modifiable exposures. However, common variants tested in GWAS typically explain a very small fraction of the variability in a measured complex trait, potentially exhibit pleiotropic effects acted upon by balancing selection or as a consequent of genetic linkage, and are rarely causal. Rare variants, which typically show large biological effects (e.g. through abolishing protein expression) provide a means of more unambiguously instrumenting relevant molecular processes. Comparison of causal estimates derived using differing methods of genetically instrumenting modifiable exposures may enhance the interpretation of the biological mechanisms underlying exposure-outcome relationships. This includes using variants from across the allele frequency spectrum, but also leveraging rare variant aggregate approaches to instrument gene-level perturbations in expression and function.
Causal estimates for pairwise combinations of the exposure and outcome relationships below have been derived using twelve instrumentation strategies.
Exposures: Low density lipoprotein levels (LDL direct), Body Mass Index (BMI), Vitamin D, Triglycerides, Glycated Haemoglobin (HbA1c), Mean Platelet Volume (MPV), IGF-1, Waist-to-Hip Ratio (BMI-corrected), Red Blood Cell (RBC) erythrocyte count and Mean Corpuscular Volume (MCV)
Outcomes: Coronary Artery Disease (CAD), Type 2 Diabetes (T2D), Multiple Sclerosis (MS), Ischemic Stroke, Atrial Fibrillation (AF), Venous Thromboembolism (VTE), Prostate Cancer and Hypertension.
Instruments
Twelve sets of instruments for each exposure have been extracted from across three sources (DeepRVAT gene impairment scores, Genebass whole exome single variants and aggregate burden masks and UKB common variant GWAS summary statistics):
Common GWAS
Associated common variants from UKB GWAS (>1% MAF)
Genebass (variants)
Common exome-wide (>5% MAF, LD clumped)
Low-frequency exome-wide (1-5% MAF, LD clumped)
Rare exome-wide (0-1% MAF, both unfiltered and filtered to the top hit per gene)
Ultra-rare exome-wide (0-0.1% MAF, both unfiltered and filtered to the top hit per gene)
Add gene annotations for instruments (taken from variant/mask position in exome data for ExWAS studies and nearest gene for common variant GWAS studies)
Code
# Annotate DeepRVAT and burden masks with relevant gene# ExWAS single variants are already annotated# Annotate common variants with nearest gene based on VEP annotation:# List of rsIDs to extract from VEP file (using VEP online interface, returning single consequence per variant --pick)# common_instruments <- unlist(lapply(harmonised_studies, function(x){# return(x$opengwas_common$SNP)# })) |> unique()# write.table(common_instruments, file.path(data_dir,"variant_annotation","complextrait_openGWASinstruments.txt"), # row.names = F, col.names = F, quote = F)vep <- data.table::fread(file.path(data_dir,"variant_annotation","vep_openGWASinstruments.txt")) |> dplyr::filter(grepl("^[0-9]",Location))## Retain variant annotations for SNPs within 1kb of a protein coding gene onlyvep_coding <- vep |>filter(BIOTYPE =="protein_coding") |>filter(!(as.numeric(DISTANCE) >1000) | DISTANCE =="-")for(i in1:length(harmonised_studies)){for(j in1:length(harmonised_studies[[i]])){if(names(harmonised_studies[[i]][j]) =="deeprvat_genescore"){ harmonised_studies[[i]][[j]]$gene.exposure =toupper(harmonised_studies[[i]][[j]]$SNP) }elseif(names(harmonised_studies[[i]][j]) %in%names(instrument_type[instrument_type =="mask"])){ harmonised_studies[[i]][[j]]$gene.exposure =toupper(gsub("_.*","",harmonised_studies[[i]][[j]]$SNP)) }elseif (names(harmonised_studies[[i]][j]) =="opengwas_common"){ gene_symbols <- vep_coding[match(harmonised_studies[[i]][[j]]$SNP, vep_coding$`#Uploaded_variation`),c("SYMBOL","Consequence")] |>as.data.frame()names(gene_symbols) <-c("gene.exposure","vep.consequence") harmonised_studies[[i]][[j]] <-cbind(harmonised_studies[[i]][[j]], gene_symbols) }else{next } }}
IVW estimates across instrument sets (subsetting to shared genes)
Difference in causal effect estimates across instrument sets could be due to differences in the underlying biological processes that are being captured by the included variants. The above analysis was repeated with restriction to instruments hitting a common set of genes/gene regions across strategies:
Comparison of Wald Ratio (subsetting to shared genes)
There are a limited number of common and rare variants for the tested complex exposures that are annotated to the same genes (this will be a more useful analysis to do for molecular traits). For genes which are shared, there are several cases where rare variants within a given gene are having disperate outcome effects. For example, rare variants in the APOB gene which are positively associated with LDL levels are showing either positive or negative effects on risk of T2D. This could potentially be due to the pleiotropic action of the APOB protein in distinct biological pathways, the relevant functional attributes of which may be differentially impacted by distruptive rare variants across the coding region…